ENH: Support MultiIndex columns in parquet (#34777) #36305

hweecat · 2020-09-12T12:46:15Z

closes ENH: Support Multi-Index for columns in parquet format #34777
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Support MultiIndex for columns in parquet format by updating value column names check to handle MultiIndexes.

1. Update check to handle MultiIndex columns for parquet format

1. add whatsnew entry

…ndex

dsaxton

Can you also add a test?

doc/source/whatsnew/v1.2.0.rst

pandas/io/parquet.py

1. Update check to handle MultiIndex columns for parquet format 2. Edit whatsnew entry. 3. Add test for writing MultiIndex columns with string column names

pandas/tests/io/test_parquet.py

1. Include issue number as a comment on added test

jreback · 2020-09-13T22:23:24Z

pandas/io/parquet.py

+            if not all(
+                x.inferred_type in {"string", "empty"} for x in df.columns.levels
+            ):
+                raise ValueError("parquet must have string column names")


can say something about 'for all values in each level of the MultiIndex'

Thanks @jreback for the suggestion on the exception statement - adding that into my next commit!

jreback · 2020-09-13T22:24:00Z

pandas/tests/io/test_parquet.py

        mi_columns = pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 1)])
        df = pd.DataFrame(np.random.randn(4, 3), columns=mi_columns)
        self.check_error_on_write(df, engine, ValueError)

+    def test_write_column_multiindex_string(self, pa):
+        # GH #34777
+        # Not supported in fastparquet as of 0.1.3 or older pyarrow version


are we > than the min pyarrow version?

Based on the min versions listed in pandas dependencies, the min pyarrow version is 0.15 while we are currently at 0.16 - at least for the dev environment that I'm working on.

jreback · 2020-09-13T22:24:17Z

pandas/tests/io/test_parquet.py

+            ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
+            ["one", "two", "one", "two", "one", "two", "one", "two"],
+        ]
+        df = pd.DataFrame(np.random.randn(8, 8), columns=arrays)


can you add names to the MultiIndex levels. do these round trip?

After adding names to the MultiIndex levels, looks like they do round trip on pytest.

pandas/io/parquet.py

jreback · 2020-09-13T22:25:14Z

doc/source/whatsnew/v1.2.0.rst

@@ -297,6 +297,7 @@ I/O
 - :meth:`to_csv` did not support zip compression for binary file object not having a filename (:issue: `35058`)
 - :meth:`to_csv` and :meth:`read_csv` did not honor `compression` and `encoding` for path-like objects that are internally converted to file-like objects (:issue:`35677`, :issue:`26124`, and :issue:`32392`)
 - :meth:`to_picke` and :meth:`read_pickle` did not support compression for file-objects (:issue:`26237`, :issue:`29054`, and :issue:`29570`)
+- :meth:`to_parquet` did not support :class:`MultiIndex` for columns in parquet format (:issue:`34777`)


i would move this to other enhancements; say this is enabled with pyarrow=.....

Moved the whatsnew entry to "other enhancements" - thanks!

1. Add tests for writing Indexes and MultiIndexes for columns 2. Edit message for check to handle MultiIndex columns for parquet 3. Edit whatsnew entry to move entry to other enhancements

pep8speaks · 2020-09-14T09:20:06Z

Hello @hweecat! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-19 01:05:10 UTC

1. Fix PEP8 issue for error message in check for MultiIndex columns

…ndex

jreback

looks fine, can you add a whatsnew note; enhancements in 1.2

does this have a min pyarrow version?

…ndex

add whatsnew entry: enhancements in 1.2

hweecat · 2020-10-11T04:04:31Z

@jreback Since the enhancements work on pyarrow 0.15.0, we could leave the min pyarrow version as >= 0.15.0 for this pull request.

I've added a whatsnew note on enhancements in 1.2 under "Other enhancements" - feel free to let me know if a more detailed whatsnew note is needed. :)

P.S. Tests are failing after I did a rebase to update my branch.

doc/source/whatsnew/v1.2.0.rst

github-actions · 2020-11-11T00:08:50Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

…ndex

alimcmaster1 · 2020-11-14T18:03:14Z

Test failures look unrelated, looks related to #37818

fsspectest = <pandas.conftest.fsspectest.<locals>.TestMemoryFS object at 0x7fb2b99e5e50>
extension = 'xls'

    @pytest.mark.parametrize("extension", ["xlsx", "xls"])
    def test_excel_options(fsspectest, extension):
        df = DataFrame({"a": [0]})
    
        path = f"testmem://test/test.{extension}"
    
>       df.to_excel(path, storage_options={"test": "write"}, index=False)

pandas/tests/io/test_fsspec.py:133:

…ndex

charlesdong1991 · 2020-11-16T10:15:13Z

/azp run

azure-pipelines · 2020-11-16T10:15:24Z

Azure Pipelines successfully started running 1 pipeline(s).

charlesdong1991

CI is back green, one minor comment on whatsnew

doc/source/whatsnew/v1.2.0.rst

jreback · 2020-11-18T15:32:26Z

@hweecat small comments, can you merge master and ping on green.

…ndex

jreback · 2020-11-19T02:09:17Z

thanks @hweecat and @charlesdong1991

charlesdong1991 · 2020-11-19T09:40:09Z

all credits to @hweecat very nice job!

mraxilus · 2021-07-23T21:26:09Z

Any reason why this explicitly raises errors for non-string multi-indicies?

hweecat added 3 commits September 12, 2020 20:43

ENH: Support MultiIndex columns in parquet (pandas-dev#34777)

681ac1f

1. Update check to handle MultiIndex columns for parquet format

ENH: Support MultiIndex columns in parquet GH34777

a46e46f

1. add whatsnew entry

Merge remote-tracking branch 'upstream/master' into io-parquet-multii…

c974259

…ndex

hweecat mentioned this pull request Sep 12, 2020

TYPING: df.columns.levels raises mypy error on MultiIndex for columns #36307

Closed

dsaxton reviewed Sep 12, 2020

View reviewed changes

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

dsaxton added Enhancement IO Parquet parquet, feather labels Sep 12, 2020

dsaxton reviewed Sep 13, 2020

View reviewed changes

pandas/io/parquet.py Outdated Show resolved Hide resolved

ENH: Support MultiIndex columns in parquet pandas-dev#34777

e9ff779

1. Update check to handle MultiIndex columns for parquet format 2. Edit whatsnew entry. 3. Add test for writing MultiIndex columns with string column names

dsaxton reviewed Sep 13, 2020

View reviewed changes

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

ENH: Support MultiIndex columns in parquet pandas-dev#34777

1b9e3f0

1. Include issue number as a comment on added test

jreback requested changes Sep 13, 2020

View reviewed changes

ENH: Support MultiIndex columns in parquet pandas-dev#34777

9e8f4eb

1. Add tests for writing Indexes and MultiIndexes for columns 2. Edit message for check to handle MultiIndex columns for parquet 3. Edit whatsnew entry to move entry to other enhancements

hweecat added 5 commits September 14, 2020 09:23

ENH: Support MultiIndex columns in parquet pandas-dev#34777

cc0f504

1. Fix PEP8 issue for error message in check for MultiIndex columns

Merge remote-tracking branch 'upstream/master' into io-parquet-multii…

3ba38fa

…ndex

Merge remote-tracking branch 'upstream/master' into io-parquet-multii…

a4131d2

…ndex

Merge remote-tracking branch 'upstream/master' into io-parquet-multii…

26966b7

…ndex

Merge remote-tracking branch 'upstream/master' into io-parquet-multii…

cc8e85c

…ndex

jreback requested changes Oct 10, 2020

View reviewed changes

jreback added this to the 1.2 milestone Oct 10, 2020

hweecat added 2 commits October 11, 2020 10:48

Merge remote-tracking branch 'upstream/master' into io-parquet-multii…

3b9b52a

…ndex

ENH: Support MultiIndex columns in parquet pandas-dev#34777

ed5fe60

add whatsnew entry: enhancements in 1.2

dsaxton reviewed Oct 11, 2020

View reviewed changes

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

fix doc failure

c859a4f

github-actions bot added the Stale label Nov 11, 2020

Merge remote-tracking branch 'upstream/master' into io-parquet-multii…

039094c

…ndex

hweecat force-pushed the io-parquet-multiindex branch from 05a2f62 to 039094c Compare November 11, 2020 03:43

Merge remote-tracking branch 'upstream/master' into io-parquet-multii…

167ae69

…ndex

Merge remote-tracking branch 'upstream/master' into io-parquet-multii…

2e4fc58

…ndex

charlesdong1991 removed the Stale label Nov 16, 2020

charlesdong1991 approved these changes Nov 16, 2020

View reviewed changes

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

charlesdong1991 and others added 2 commits November 18, 2020 16:37

Update doc/source/whatsnew/v1.2.0.rst

180ddff

Merge remote-tracking branch 'upstream/master' into io-parquet-multii…

234009b

…ndex

charlesdong1991 mentioned this pull request Nov 18, 2020

CI: Fix ci flake error #37938

Merged

1 task

Merge remote-tracking branch 'upstream/master' into io-parquet-multii…

ab24628

…ndex

jreback approved these changes Nov 19, 2020

View reviewed changes

jreback merged commit 5f2bac5 into pandas-dev:master Nov 19, 2020

bluss mentioned this pull request Oct 30, 2021

Fastparquet to support column MultiIndex dask/fastparquet#616

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Support MultiIndex columns in parquet (#34777) #36305

ENH: Support MultiIndex columns in parquet (#34777) #36305

hweecat commented Sep 12, 2020 •

edited

Loading

dsaxton left a comment

jreback Sep 13, 2020

hweecat Sep 14, 2020

jreback Sep 13, 2020

hweecat Sep 14, 2020

jreback Sep 13, 2020

hweecat Sep 14, 2020

jreback Sep 13, 2020

hweecat Sep 14, 2020

pep8speaks commented Sep 14, 2020 •

edited

Loading

jreback left a comment

hweecat commented Oct 11, 2020

github-actions bot commented Nov 11, 2020

alimcmaster1 commented Nov 14, 2020 •

edited

Loading

charlesdong1991 commented Nov 16, 2020

azure-pipelines bot commented Nov 16, 2020

charlesdong1991 left a comment

jreback commented Nov 18, 2020

jreback commented Nov 19, 2020

charlesdong1991 commented Nov 19, 2020

mraxilus commented Jul 23, 2021

ENH: Support MultiIndex columns in parquet (#34777) #36305

ENH: Support MultiIndex columns in parquet (#34777) #36305

Conversation

hweecat commented Sep 12, 2020 • edited Loading

dsaxton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Sep 14, 2020 • edited Loading

Comment last updated at 2020-11-19 01:05:10 UTC

jreback left a comment

Choose a reason for hiding this comment

hweecat commented Oct 11, 2020

github-actions bot commented Nov 11, 2020

alimcmaster1 commented Nov 14, 2020 • edited Loading

charlesdong1991 commented Nov 16, 2020

azure-pipelines bot commented Nov 16, 2020

charlesdong1991 left a comment

Choose a reason for hiding this comment

jreback commented Nov 18, 2020

jreback commented Nov 19, 2020

charlesdong1991 commented Nov 19, 2020

mraxilus commented Jul 23, 2021

hweecat commented Sep 12, 2020 •

edited

Loading

pep8speaks commented Sep 14, 2020 •

edited

Loading

alimcmaster1 commented Nov 14, 2020 •

edited

Loading